Understanding Data Skewness

L3 06 03 Understanding Data Skew V2

Introduction to Dataset

In the real world, you’ll see a lot of cases where the data is skewed. Skewed data means due to non-optimal partitioning, the data is heavy on few partitions. This could be problematic. Imagine you’re processing this dataset, and the data is distributed through your cluster by partition. In this case, only a few partitions will continue to work, while the rest of the partitions do not work. If you were to run your cluster like this, you will get billed by the time of the data processing, which means you will get billed for the duration of the longest partitions working. This isn’t optimized, so we would like to re-distribute the data in a way so that all the partitions are working.

Figure A. Time to process with non-optimal partitioning with skewed data

Introduction to Dataset Image

Figure B. Time to process with optimal partitioning with skewed data

Introduction to dataset image 2

Let’s recap what we saw in the video

In order to look at the skewness of the data:

Check for MIN, MAX and data RANGES
Examine how the workers are working
Identify workers that are running longer and aim to optimize it.